Project 5

Project Description Part A (The Power of Diffusion Models)

In part A of this project, I play around with DeepFloyd pre-trained diffusion model, including implementing diffusion sampling loops and using them for other tasks such as inpainting and creating optical illusions. DeepFloyd was trained as a text-to-image model, which takes text prompts as input and outputs images that are aligned with the text.

The instructions for part A can be found here.

PartA 0 Sampling from the Model

To start off, our pre-trained DeepFloyd diffusion model contains two stages. The first stage produces images of size 64 * 64 and the second stage takes the outputs of the first stage and generates images of size 256 * 256.

In each stage of the pre-trained model, I can set the num_inference_steps, which controls the number of denoising iterations. Below are sample output image results of differing num_inference_steps values. As shown, it is obvious that the quality of output images are higher as num_inference_steps increases. It is most notable in two aspects: the diversity of colors and the details. For example, with the text prompt of “an oil painting of a snowy mountain village”, the colors of the output images are significantly more diverse and bright, which is reflective of oil painting impressions. In addition, with the text prompt of “a rocket ship”, the cloud looks much more real when num_inference_steps = 100 compared to when num_inference_steps = 10.

The random seed I set is 180.


(stage1, stage2) = (10, 10)

an oil painting of
a snowy mountain village

a man wearing a hat

a rocket ship


(stage1, stage2) = (50, 50)

an oil painting of
a snowy mountain village

a man wearing a hat

a rocket ship


(stage1, stage2) = (100, 100)

an oil painting of
a snowy mountain village

a man wearing a hat

a rocket ship


PartA 1.1 Implementing the Forward Process

In order to play around with the denoising pre-trained diffusion model, we first need a process to add noise to clean images. It is called the forward process. The key parameter controlling how much noise to add is t, which is a value between 0 and 999. The larger the t value, the closer the image is to white noise.

Below are three noisy images as output results of the forward process. They have t values equal to 250, 500, 750 respectively.


Campanile Original

Noisy Campanile (t = 250)

Noisy Campanile (t = 500)

Noisy Campanile (t = 750)


PartA 1.2 Classical Denoising

In this part, we denoise the three noisy images from the previous part using classical gaussian blur method. Though trying different values of kernel size and sigma parameters, I am unable to obtain good denoising results.


Gaussian Blur Denoising

Denoised Campanile (t = 250)

Denoised Campanile (t = 500)

Denoised Campanile (t = 750)


PartA 1.3 One-Step Denoising

Rather than an iterative denoising process (which is essential in diffusion models), in this part we use one-step denoising. We do this by estimating noise using the pre-trained model, then reverse the process from the forward process where we add noise to clean images. As shown below, one-step denoising performs much better than classical gaussian blur denoising.


One-Step Denoising

Denoised Campanile (t = 250)

Denoised Campanile (t = 500)

Denoised Campanile (t = 750)


PartA 1.4 Iterative Denoising

In this part it is finally time to implement iterative denoising using the same pre-trained noise estimator from one-step denoising. Rather than doing iterative denoising of every t value (which is between 0 and 999), there is a shortcut to denoise images at t intervals of 30 (recommended by the instructor).

The example below first adds noise to the clean original campanile image to t = 690. Then it iteratively denoises the image at t steps of 30. At every 150 intervals of t, I also showed the noisy campanile images.

Furthermore, we can compare the denoising performances between classical gaussian blur, one-step, and iterative approaches. The gaussian method is easily the worst, the one-step method does not contain details, whereas the iterative method is the winner of three.


Iterative Denoising from t = 690

Noisy Campanile
(t = 90)

Noisy Campanile
(t = 240)

Noisy Campanile
(t = 390)

Noisy Campanile
(t = 540)

Noisy Campanile
(t = 690)


Campanile
Original

Denoised Campanile
(Iterative Approach)

Denoised Campanile
(One-Step Approach)

Denoised Campanile
(Guassian Blur Approach)


PartA 1.5 Diffusion Model Sampling

Changing our mindset a bit, if we denoise from a completely white noise image, we are able to use the same mechanism as iterative denoising to generate completely new images given a text prompt. Below are five sample output results. The input text prompt is “a high quality photo”.



PartA 1.6 Classifier-Free Guidance (CFG)

Although sample output images from the previous part are generative, their quality is not high (lack of colors and details), indicating that we need some changes to the mechanism to make our model follow more closely to the input text prompt. This mechanism change is called classifier-free guidance (CFG).

Basically, in CFG, instead of inputting a single text prompt, we now input two text prompts, the first being the same as whatever our actual text prompt is, and the second being an empty string. We then estimate noise from both text prompts in each iteration. Using the combination of the two noise estimates, we then denoise the noisy image.

Below are five sample output results from CFG, which indeed have higher quality compared to sample output results from the previous part.



PartA 1.7 Image-to-image Translation

In partA 1.4, we take a real image, add noise to it, and then denoise. This effectively allows us to make edits to existing images. The more noise we add, the larger the edit will be. This works because in order to denoise an image, the diffusion model must to some extent "hallucinate" new things -- the model has to be "creative." Another way to think about it is that the denoising process "forces" a noisy image back onto the manifold of natural images.

In this part, we add different amounts of noise to a clean image and then denoise them. We want to see that as we add less noise, the denoised output image shall look closer to the original clean image.

Here, i_start is a different parameter that represents the same meaning as parameter t, as explained in partA 1.1. Larger i_start values correspond to lower t values, or less noise added to the clean input image.


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Campanile Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Elephant Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 20

Plane Original


PartA 1.7.1 Editing Hand-Drawn and Web Images

This part is a real world application example of partA 1.7, where we want to generate a new image but we also want the new generated image to follow a certain structure from an input image. Depending on how much we want the output result image to be different from the input image, we can choose how much noise to add before the iterative denoising process.


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Avocado Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Handdrawn Car Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Handdrawn Suncloud Original


PartA 1.7.2 Inpainting

Another fun application of generating a new image while preserving some parts of the input image is inpainting, or generating new image only in the masked region of the input image.


Campanile Original

Campanile Mask

Campanile Inpainted


Dog Original

Dog Mask

Dog Inpainted


Painting Original

Painting Mask

Painting Inpainted


PartA 1.7.3 Text-Conditional Image-to-image Translation

This part combines the previous part with partA 1.7.1, where we add different amounts of noise to the masked region in the input image before denoising them. Again, larger i_start values correspond to smaller t values, and thus we expect output results to look closer to the original input image.


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Campanile Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Dog Original


i_start = 1

i_start = 3

i_start = 5

i_start = 7

i_start = 10

i_start = 15

i_start = 20

i_start = 25

i_start = 30

Painting Original


PartA 1.8 Visual Anagrams

In this part we make small changes to our iterative denoising process to create a type of optical illusions: visual anagrams, where a single image tells two different stories through flipped and unflipped views.

The small change we make is inputting three text prompts, two of which are conditional text prompts each telling a story and the third text prompt is unconditional empty string that enhances the quality of output images. In each iteration, we use the pre-trained model to estimate noise from an unflipped conditional text prompt and a flipped conditional text prompt. We average the two noise estimates in the denoising process.

Check this out! It’s super cool!


an oil painting of an old man

an oil painting of people around a campfire


a photo of a man

a photo of a dog


a lithograph of waterfalls

an oil painting of a snowy mountain village


PartA 1.9 Hyrbird Images

In this part we make small changes to our iterative denoising process to create another type of optical illusions: hybrid images, where a single image looks like two different images when viewing from up close and far away. This is basically the same as project 2 part 2.2, except for that now we are generating new hybrid images using the pre-trained model.

The small change we make is in each iteration, we estimate noise from two conditional text prompts. We keep only the high frequency noise estimate of the image we want to see up close and the low frequency noise estimate of the image we want to see far away. We then add the two noise estimates.


low frequency: a lithograph of a skull
high frequency: a photo of a dog

low frequency: a pencil
high frequency: a rocket ship

low frequency: a lithograph of a skull
high frequency: a lithograph of waterfalls


Project Description Part B (Diffusion Models from Scratch)

In part B of this project, I train my own diffusion model on the MNIST dataset, which contains hand-written ten-digit images. There are 60k training images and 10k testing images.

In partB 1, I implemented a single-step denoising network, similar to partA 1.3. In partB 2, I added time-conditioning and class-conditioning to the single-step denoising UNet.

The instructions for part B can be found here.

PartB 1 Training a Single-Step Denoising U-Net

In order to train a single-step denoising UNet, I followed the architecture instructions step by step. First, I wrote the add_noise function, in which sigma represents how much noise to add to an input image. Larger sigma values mean more noisy images. Second, I implemented the simple and composed operations, as well as the UnconditionalUNet and train functions. Third, I train the model with 5 epochs and a batch size of 256. The computation cost using T4 GPU on google colab is 7 minutes.




Checkpoint Model Output 1st Epoch

Checkpoint Model Output 5th Epoch



We compare the model performance between epoch 1 and epoch 5. It is obvious that the epoch 1 model produces somewhat grayish outputs whereas the epoch 5 model produces outputs closest to the clean images.

Since the input noisy images (training data) in our model all share a sigma value equal to 0.5, we want to know whether our model performs well on other levels of noisy input images. As shown in the above example, it seems that our model is able to denoise pretty well on sigma values less than or equal to 0.8.

PartB 2 Training a Diffusion Model (Time-Conditioning)

In this part, I made some minor changes to the single-step denoising UNet. In particular, I added the embedding of the denoising step (t) to the decoding layers twice in the architecture. In addition, while the single-step UNet outputs a clean image recovered from the noisy input image, the time-conditional UNet outputs the white noise image that is used in the noise process. Lastly, this part is similar to partA 1.4 in that we now iteratively denoise an image rather than denoising in a single step.

I trained the model with 20 epochs and a batch size of 128. The computation cost using T4 GPU on google colab is 15 minutes.



Checkpoint Model Output 5th Epoch


Checkpoint Model Output 20th Epoch


After training the time-conditional UNet, I sampled 50 white noise images and made the model denoise. We compared the model performance between epoch 5 and epoch 20. It is obvious that the epoch 20 model outperforms the epoch 5 model. While the epoch 5 model generates digit images which are completely unrecognizable roughly half of the time, the epoch 20 model generates recognizable images almost all the time, despite a few defects such as digit 0 closing the circle.

For unknown reasons, my randomly sampled test cases contain digits 6, 8, and 0 mostly. I am unsure whether this is due to model learning.

PartB 2 Training a Diffusion Model (Time-and-Class-Conditioning)

This part is similar to partA 1.6 where I added classifier-free guidance to the time-conditional UNet. In terms of the architecture, I added the embedding of a class label (one-hot encoded) to the same layers where I added time-conditioning in the decoding process. When I do sampling, the noise estimate at each denoising step is a combination of noise estimates of class-conditioned and class-unconditioned.

Exactly the same as training the time-conditional model, I trained this model with 20 epochs and a batch size of 128. The computation cost using T4 GPU on google colab is 15 minutes.



Checkpoint Model Output 5th Epoch


Checkpoint Model Output 20th Epoch


Above are test case outputs from the epoch 5 and epoch 20 models respectively. I generated five outputs for each class. As we can see, the two checkpoint models do not differ in performance by much. I think this is because the epoch 5 model already has a pretty low training loss (around the 2000th batch step) compared to the epoch 20 model. Maybe the epoch 5 model outperforms the epoch 20 model in generating digit 5. The epoch 20 model also just contains less noise other than the digit pixels.